Skunkware 5

home *** CD-ROM | disk | FTP | other *** search

/ Skunkware 5 / Skunkware 5.iso / src / X11 / wais / ir / DESIGN < prev next >

Wrap

Text File | 1995-05-09 | 4KB | 96 lines

Serial Indexer Overview Brewster Kahle The serial indexer is a simple inverted file system that is not very different from existing IR systems. DATABASE FILES The serial indexer parses files and creates an inverted file index made up of 7 files. For a database named "index" the files would be: * index.inv -- the "postings file" that is a term followed by a list of entries each of which describe where that word occurs in the original files. A posting is a weight, doc_number (see the doc file), and character_position. This file is indexed with the dictionary (dct) file. The terms are in alphabetical order. * index.doc -- this is a linear list of document-entries one for each document. A document can be a complete file or a piece of a file (such as mail files that are the concatentation of many messages, each message is a document). The information kept in each entry is: filename_id: position into the filename file (fn) of the filename for this document. headline_id: position in the headline file. start_character: position in the file where this document starts end_character: end position. 0 if complete file. document_length: in characters number_of_lines date: time_t * index.fn -- list of the filenames in the database with the write-date of the file and the type of the file. Type is a string. Indexed by the position in the file, so this file can not be edited after the index is built. * index.hl -- list of the headlines. Indexed by position. * index.dct -- dictionary file which is a 2 level b-tree. The first block is pointers to the every 1000th entry in the rest of the dictionary file. Each entry is a fixed-length record of the word with the position into the rest of the file. The rest of the file are blocks just like the first block, but each entry is the word plus the position of it in the inverted file (inv). The whole dictionary is in alphabetical order. * index.src -- source description that is used to access the database. It is also returned as a response to the "help" query. This file is not overwritten once it is created. Therefore database maintainers should edit this file to add a good desciption of what that database contains. * index.status -- only the ram based seeker uses this to describe itself and get parameters from the user. INDEXING A new index is built by parsing the input files, finding the words in it and creating the filename, headline, and doc tables at parse time. The words are then passed to routines defined in irext.h which define the frontend/backend boundary. The serial indexer creates intermediate inv files starting at inv0 then inv1 etc. These are created by accumulating the words in a memory hashtable. Each invN file is in alphabetical order, so they can be merged easily into one inv file. This merging is done logarithmically, so that it is fast. Unfortunately, it means that just before the merging is done, the index occupies twice the space that it will finally. On the final merge, a dictionary file is produced by dumping the positions of the start of terms in the final inverted file. Adding to an exising database is done by adding to the filename, headline, and doc tables as the parsing is done, but waiting to merge the new invN files into the old inv file until all the new merging is done. Theoretically it should be able to be done while always having the database usable. SEARCHING A query is parsed for its seed words and relevant documents and is passed to a function search_word (defined in irext.h). The backend does whatever it will do to search and set some state. Then best_hit is called iteratively to find the documents that most closely matched the query. These results are used to look up headlines and filenames to hand back to the user. The serial backend does this by loading the postings for a particular term and then adding it into a score array. The score array has an entry for each document. A document gets its score increased by containing terms in the query. Roughly the goals of this design were: Make a flexible platform for experimentation Portable Low search overhead for fast startup and lightweight access Usable in pieces for multiple search engines backends Fast searching Easy to implement Size of the resulting database was not a priority